Introduction

Airbnb is an online vacation rental marketplace servicing a community of hosts and travellers. The diagram below shows the process of how Airbnb started with two individuals who could not pay for rent in 2007 to starting a company that reached US$10 billion valuation by 2014.In 2020, Airbnb went public with valuation of up to US$47 million. valuation of up to US$47 million

Adapted from Adioma

According to Airbnb, Airbnb has millions of listings in over 220 counties and regions across 100,000 cities. The data generated provides rich information, including structured data e.g. price and location, as well as unstructured data e.g. reviews and listing descriptions. While there are statistical and analytic tools available to derive insights using these data, these tools are often subscription-based and require technical knowledge, which may not be available or accessible to everyone. Hence, this project aims to develop an interface which is concise, interactive, and user-friendly using R Shiny. With this interface, data-based decisions can be made from the interactive GUI. The R Shiny app will cover exploratory data analysis, confirmatory data analysis, text mining, as well as predictive analysis.

This assignment is sub-module of our final Shiny-based Visual Analytics Application (Shiny-VAA). In particular, a focus on text mining utilising various R packages will be presented. The process is shown below:

Application Use Case

Our application can be used from both the perspective of hosts and guests.

Hosts: In 2014, Airbnb launched the Superhost programme to reward hosts with outstanding hospitality. As a Superhost, one will have better earnings, more visibility, and are able to earn exclusive rewards such as increased earnings compared to regular hosts. To become a Superhost, these are the criteria to be met: - 4.8 or higher overall rating based on reviews - Completed at least 10 stays in the past year or 100 nights over at least 3 completed stays - Less than 1% cancellation rate, not including extenuating circumstances - Responds to 90% of new messages within 24 hours

Guests: With over 60,000 members and 6,000 properties listed on Airbnb website, a dilemma on which is the right space might be of concern to users. Various modules in our dashboard will allow both types of users to analyse Airbnb data according to their needs.

Data

InsideAirbnb provides tools and data for users to explore Airbnb. We will be using the following files: - listing.csv.gz: This dataset consists of 74 variables and 4256 data points.
- reviews.csv.gz: This dataset provides 6 variables and 52368 data points.
While the team has decided to use the latest set of data compiled on 27 January 2021, this report uses data compiled on 29 December 2020 for completeness.

Literature Review

Conducting literature review on how the analysis were performed before. The focus should be on identifying gaps whereby interactive web approach and visual analytics techniques can be used to enhance user experience on using the analysis techniques.

Airbnb data has been widely used for text mining in tools like Python and R. In Python, (Natural Language Processing Toolkit)[https://www.nltk.org/] has easy-to-use interfaces to over 50 corpora and lexical resource, as well as a wide range of text processing libraries for tokenisation, stemming, classification etc. Similarly, R has extensive libraries such as tidyverse and Shiny which allows for text mining and building of interactive dashboards.

Zhang (2019) used text mining approaches including content analysis and topic modelling (Latent Dirichlet Allocation (LDA) method) to examine over 1 million Airbnb reviews across 50,933 listings in the United States of America (USA). Kiatkawsin, Sutherland & Kim (2020) also used LDA method to compare reviews between Hong Kong and Singapore. However, these articles do not provide visualiation of the methods used and are not interactive.

Kim’s Shiny Airbnb App provided dashboard which is interactive for Exploratory Data Analysis (EDA), but left out reviews. [Ankit Pandey] (https://github.com/ankit2web/Twitter-Sentiment-Analysis-using-R-Shiny-WebApp) provided a more comprehensive text analytics dashboard using wordcloud and polarity of sentiments, but does not provide much interactivity.

To solve the above gaps, the next section outlines the steps:

Submodules

Data Preparation

Extracting, wrangling and preparing the input data required to perform the analysis. The focus should be on exploring appropriate tidyverse methods

R Markdown

runtime:shiny was added to allow dynamic documentation. {r} part of the code chunk can be used to specify elements and subsequently rendered into different format. echo=TRUE is set to allow printing of code chunk when rendered into a different file format. More details can be found at R Markdown Documentation.

Packages

To install multiple packages and load the libraries, run the following codes chunk:

packages <- c("tidyverse","sf","tmap","crosstalk","leaflet","RColorBrewer","ggplot2","rgdal", "rgeos", "raster", "maptools","tmaptools","shiny","tidytext","wordcloud","wordcloud2","tm","ggthemes","igraph","ggmap","DT","reshape2","ggraph","topicmodels","tidytext","topicmodels","quanteda","tm","RColorBrewer","DataExplorer")

for (p in packages){
  if (!require(p,character.only=T)){
    install.packages(p)
  }
  library(p, character.only=T)
}
## Loading required package: tidyverse
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Loading required package: sf
## Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
## Loading required package: tmap
## Loading required package: crosstalk
## Loading required package: leaflet
## Loading required package: RColorBrewer
## Loading required package: rgdal
## Loading required package: sp
## rgdal: version: 1.5-23, (SVN revision 1121)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.2.1, released 2020/12/29
## Path to GDAL shared files: C:/Users/joeyc/OneDrive/Documents/R/win-library/4.0/rgdal/gdal
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 7.2.1, January 1st, 2021, [PJ_VERSION: 721]
## Path to PROJ shared files: C:/Users/joeyc/OneDrive/Documents/R/win-library/4.0/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.4-5
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading rgdal.
## Overwritten PROJ_LIB was C:/Users/joeyc/OneDrive/Documents/R/win-library/4.0/rgdal/proj
## Loading required package: rgeos
## rgeos version: 0.5-5, (SVN revision 640)
##  GEOS runtime version: 3.8.0-CAPI-1.13.1 
##  Linking to sp version: 1.4-5 
##  Polygon checking: TRUE
## Loading required package: raster
## 
## Attaching package: 'raster'
## The following object is masked from 'package:dplyr':
## 
##     select
## The following object is masked from 'package:tidyr':
## 
##     extract
## Loading required package: maptools
## Checking rgeos availability: TRUE
## Loading required package: tmaptools
## Loading required package: shiny
## 
## Attaching package: 'shiny'
## The following object is masked from 'package:crosstalk':
## 
##     getDefaultReactiveDomain
## Loading required package: tidytext
## Loading required package: wordcloud
## Loading required package: wordcloud2
## Loading required package: tm
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## Loading required package: ggthemes
## Loading required package: igraph
## 
## Attaching package: 'igraph'
## The following object is masked from 'package:raster':
## 
##     union
## The following object is masked from 'package:rgeos':
## 
##     union
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
## Loading required package: ggmap
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
## Loading required package: DT
## 
## Attaching package: 'DT'
## The following objects are masked from 'package:shiny':
## 
##     dataTableOutput, renderDataTable
## Loading required package: reshape2
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
## Loading required package: ggraph
## 
## Attaching package: 'ggraph'
## The following object is masked from 'package:sp':
## 
##     geometry
## Loading required package: topicmodels
## Warning: package 'topicmodels' was built under R version 4.0.5
## Loading required package: quanteda
## Warning: package 'quanteda' was built under R version 4.0.5
## Package version: 2.1.2
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:igraph':
## 
##     as.igraph
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords
## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-
## The following object is masked from 'package:utils':
## 
##     View
## Loading required package: DataExplorer
## Warning: package 'DataExplorer' was built under R version 4.0.5

Import Data

Use the read_csv() function to determine the path to the file to read. It prints out a column specification that gives the name and type of each column. As the are unnecessary columns, select() function is use to retain only the columns used in subsequent analysis. - review file contains 52367 observations with 6 variables; 2 columns (listing_id and comments) are retained. - listings file contains 4255 observations with 74 variables; 33 columns are retained.

reviews <- read_csv("C:/Users/joeyc/blog/_posts/2021-03-29-assignment/data/reviews.csv")%>% 
  dplyr::select(listing_id,comments)
## 
## -- Column specification --------------------------------------------------------
## cols(
##   listing_id = col_double(),
##   id = col_double(),
##   date = col_date(format = ""),
##   reviewer_id = col_double(),
##   reviewer_name = col_character(),
##   comments = col_character()
## )
listings <- read_csv("C:/Users/joeyc/blog/_posts/2021-03-29-assignment/data/listings.csv")  %>% 
  rename(listing_id=id) %>% 
  dplyr::select(-c(listing_url, scrape_id, last_scraped, name, picture_url,host_url, host_about,host_thumbnail_url, host_picture_url, host_listings_count, host_verifications,calendar_updated,first_review,last_review,license,neighborhood_overview,description,host_total_listings_count,host_has_profile_pic,availability_30,availability_60,availability_90,availability_365,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,number_of_reviews_ltm,number_of_reviews_l30d,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_last_scraped,has_availability,instant_bookable))
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   listing_url = col_character(),
##   last_scraped = col_date(format = ""),
##   name = col_character(),
##   description = col_character(),
##   neighborhood_overview = col_character(),
##   picture_url = col_character(),
##   host_url = col_character(),
##   host_name = col_character(),
##   host_since = col_date(format = ""),
##   host_location = col_character(),
##   host_about = col_character(),
##   host_response_time = col_character(),
##   host_response_rate = col_character(),
##   host_acceptance_rate = col_character(),
##   host_is_superhost = col_logical(),
##   host_thumbnail_url = col_character(),
##   host_picture_url = col_character(),
##   host_neighbourhood = col_character(),
##   host_verifications = col_character(),
##   host_has_profile_pic = col_logical()
##   # ... with 17 more columns
## )
## i Use `spec()` for the full column specifications.

Merge Data

right_join() is used to merge the listings and review files so that all rows from listings will be returned.

data <- right_join(reviews,listings,by="listing_id")

Save file

To write to CSV for future usage, run the following code without hashtag(#).

#write.csv(data,"data.csv")

View data

glimpse(data)
## Rows: 54,074
## Columns: 34
## $ listing_id                   <dbl> 49091, 50646, 50646, 50646, 50646, 50646,~
## $ comments                     <chr> "Fran was absolutely gracious and welcomi~
## $ host_id                      <dbl> 266763, 227796, 227796, 227796, 227796, 2~
## $ host_name                    <chr> "Francesca", "Sujatha", "Sujatha", "Sujat~
## $ host_since                   <date> 2010-10-20, 2010-09-08, 2010-09-08, 2010~
## $ host_location                <chr> "Singapore", "Singapore, Singapore", "Sin~
## $ host_response_time           <chr> "within a few hours", "a few days or more~
## $ host_response_rate           <chr> "100%", "0%", "0%", "0%", "0%", "0%", "0%~
## $ host_acceptance_rate         <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A",~
## $ host_is_superhost            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,~
## $ host_neighbourhood           <chr> "Woodlands", "Bukit Timah", "Bukit Timah"~
## $ host_identity_verified       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,~
## $ neighbourhood                <chr> NA, "Singapore, Singapore", "Singapore, S~
## $ neighbourhood_cleansed       <chr> "Woodlands", "Bukit Timah", "Bukit Timah"~
## $ neighbourhood_group_cleansed <chr> "North Region", "Central Region", "Centra~
## $ latitude                     <dbl> 1.44255, 1.33235, 1.33235, 1.33235, 1.332~
## $ longitude                    <dbl> 103.7958, 103.7852, 103.7852, 103.7852, 1~
## $ property_type                <chr> "Private room in apartment", "Private roo~
## $ room_type                    <chr> "Private room", "Private room", "Private ~
## $ accommodates                 <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,~
## $ bathrooms                    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ bathrooms_text               <chr> "1 bath", "1 bath", "1 bath", "1 bath", "~
## $ bedrooms                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ beds                         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ amenities                    <chr> "[\"Washer\", \"Elevator\", \"Long term s~
## $ price                        <chr> "$80.00", "$80.00", "$80.00", "$80.00", "~
## $ number_of_reviews            <dbl> 1, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18~
## $ review_scores_rating         <dbl> 94, 91, 91, 91, 91, 91, 91, 91, 91, 91, 9~
## $ review_scores_accuracy       <dbl> 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9~
## $ review_scores_cleanliness    <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1~
## $ review_scores_checkin        <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1~
## $ review_scores_communication  <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1~
## $ review_scores_location       <dbl> 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,~
## $ review_scores_value          <dbl> 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,~

glimpse() does not present the data in a tabular format, hence datatable and kable packages were considered.However, - datatable() does not work well with the extensions of FixedColumns, FixedHeader and Scoller when coupled with Shiny. Hence, these specific functionalities are excluded. - kable() is not up to date with the current version of R and was not used.

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

Exploratory Data Analysis (EDA)

EDA is conducted briefly to allow a summary of the data’s main characteristics and identify any potential issues that may arise. - ggplot is commonly used to map variables to aesthetics. However, plotting each graph is tedious thus will not be used. - DataExplorer is similar to ProfilerReport() in Python, which automates data handling and visualisation. This allows more time and focus on understanding the underlying data, as well as to study and extract insights.

To generate a report, run the following code without the hashtag(#)

#create_report(data)

Missing data profile, univariate distribution and bar chart with frequency can be observed.

Text Mining

Testing and prototyping the proposed sub-module in R Markdown. The R Markdown document must be in full working html report format. This link1 and link2 provides useful examples for your reference.

A comparison of various packages are shown below:

Tidy Text Format

For easier and more effective data handling, this section outlines the steps to ensure tidy data i.e. each variable is a column, each observation is a row and each type of observational unit is a table. In other words, tidy text becomes a table with one-token per row.

Data Types

  • To change host_since into date format, use as.data().
  • To change listing_id data type to categorical, use as.character().
data$host_since <- as.Date(data$host_since, format = '%d/%m/%Y')
data$listing_id <- as.character(data$listing_id)

Cleaning

  • To remove anything but alphabets, use gsub()
  • To change all alphabets to lowercase, use tolower()
  • To remove dollar sign in price, use str_remove()
  • To remove backslash in amenities, use gsub()
  • To remove locations not in normalised singapore text, use subset()
  • To drop rows
  • To remove Non-ASCII characters including Chinese, Japanese, Korean etc, use iconv()
data$comments <- gsub("^[alpha:][:space:]'\"]", " ",data$comments) %>% 
  tolower()
data$comments <- iconv(data$comments,to="UTF-8")


data$price <- str_remove(string=data$price,pattern='\\$') %>% 
  as.numeric()
## Warning in str_remove(string = data$price, pattern = "\\$") %>% as.numeric():
## NAs introduced by coercion
data$host_response_rate <- gsub("%","",data$host_response_rate) 
data$amenities <- gsub('\"', "", data$amenities, fixed = TRUE)
data <- subset(data,host_location=="Singapore" | host_location=="Singapore, Singapore" | host_location=="SG")

Stop Words Due to the following reasons, additional stop words are added: - reviews/comments mentioned names consistently, and they are not relevant to analysis - stop_words is not comprehensive enough

Unnest tokens - unnest_token is used to split a column into tokens such that they are flattened into one-token-per row. - anti_join() removes stop words contained in the reviews/comments.

data_comments <- data %>% 
  dplyr::select(listing_id,comments,review_scores_rating,neighbourhood_cleansed,neighbourhood_group_cleansed)%>%
  unnest_tokens(word,comments) %>% 
  group_by(listing_id) %>% 
  ungroup() %>% 
  anti_join(stop_words)
## Joining, by = "word"

Frequency count() or n() can find the most common words that occurred in the reviews.

data_count <- data_comments %>% 
  group_by(word) %>% 
  summarise(frequency=n())

Wordcloud Worldcloud provides an easy way to show how frequent a word appears in a corpus. In wordcloud, the size of a word indicates how frequent the word appears in a given text. Costs and benefits of wordcloud:

  • worldcloud was considered. However, there are limitations in terms of flexibility in usage. Additionally, during renderwordcloud seemed to have issue rendering in shiny.
  • wordcloud2 provides more flexibility through usage of font, colour, rotations etc. renderwordcloud2 works better for rendering. However, in R Markdown, the rendering issue appeared for any information that follows the wordcloud.
tidy_wordcloud <- data_count %>% 
  wordcloud2()
tidy_wordcloud

For more impactful and better visualisation, the following are taken into considerations:

  • colour
#custom <- wordcloud2(data_cloud,colorPalette = "RdYlBu")
  • shape and letters
#custom_cloud <- wordcloud2(wordcloud,figPath="airbnb.png")
#letterCloud(wordcloud,word="AIRBNB")
  • dictionaries (afinn, bing, nrc)

AFINN

afinn <- get_sentiments("afinn") 
afinn_sentiments <- data_comments %>% 
  inner_join(afinn) %>% 
  count(word,value,sort=TRUE) %>% 
  acast(word~value,value.var="n",fill=0) %>% 
  comparison.cloud()
## Joining, by = "word"
## Warning in brewer.pal(max(3, ncol(term.matrix)), "Dark2"): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in comparison.cloud(.): outstanding could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): awesome could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): disappoint could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): complained could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): emergency could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): scared could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): challenge could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): chilling could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): promised could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): alarm could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): struggle could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): pleased could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): beautifully could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): impressed could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): shocked could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): blocked could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): adorable could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): refused could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): steal could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): disappointment could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): misunderstanding could not be fit on page. It
## will not be plotted.
## Warning in comparison.cloud(.): useless could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): desperately could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): sucks could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): grateful could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): accused could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): crying could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): disadvantage could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): shame could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): discounted could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): doubts could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): forgotten could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): solved could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): admit could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): leaked could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): unbelievable could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): unsure could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): accident could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): frustrating could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): risk could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): ruined could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): shock could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): skeptical could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): deceiving could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): keen could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): isolated could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): prevent could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): complains could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): fear could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): inconsiderate could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): lagged could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): lonely could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): regrets could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): screaming could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): weary could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): inviting could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): solution could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): forced could not be fit on page. It will not be
## plotted.

afinn_count <- data_comments %>% 
  #group_by(listing_id) %>% 
  inner_join(afinn) %>% 
  count(word,value) %>% 
  filter(n>500) %>% 
  #mutate(n=ifelse(sentiment=="negative",-n,n)) %>% 
  mutate(word=reorder(word,n)) %>% 
  ggplot(aes(word,(n)))+
  geom_col()+
  coord_flip()
## Joining, by = "word"
afinn_count

BING

## Joining, by = "word"

sorry, please self-censor some words

bing_count <- data_comments %>% 
  inner_join(bing) %>% 
  count(word,sentiment) %>% 
  filter(n>500) %>% 
  mutate(n=ifelse(sentiment=="negative",-n,n)) %>% 
  mutate(word=reorder(word,n)) %>% 
  ggplot(aes(word,n,fill=sentiment))+
  geom_col()+
  coord_flip()
## Joining, by = "word"
bing_count

NRC

## Joining, by = "word"
## Warning in brewer.pal(max(3, ncol(term.matrix)), "Dark2"): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in comparison.cloud(.): toilet could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): money could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): late could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): peaceful could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): interior could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): leave could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): feeling could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): neighborhood could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): disappointed could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): inconvenient could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): sweet could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): hanging could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): instructions could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): treat could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): explore could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): honest could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): terrible could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): sin could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): messy could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): hesitation could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): boy could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): pressure could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): humble could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): memorable could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): unexpected could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): departure could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): mother could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): prepared could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): delayed could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): unpleasant could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): powerful could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): efficient could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): negative could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): garbage could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): mum could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): concerned could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): cable could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): electric could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): drinking could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): bang could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): surprisingly could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): lush could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): procedure could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): plan could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): amazingly could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): delay could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): cramped could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): lounge could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): guard could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): information could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): dirt could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): warned could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): cash could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): disgusting could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): worse could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): team could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): disappoint could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): worried could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): gross could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): medical could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): misleading could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): lower could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): abundance could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): escape could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): providing could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): cleanliness could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): arrive could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): sultan could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): unfriendly could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): rubbish could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): entertainment could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): break could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): including could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): holiday could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): playground could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): crazy could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): heritage could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): bear could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): horrible could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): broke could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): terminal could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): mess could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): lying could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): afraid could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): bug could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): sun could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): disappointing could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): fell could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): fussy could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): waste could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): challenge could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): vacation could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): glad could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): traveling could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): superb could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): dank could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): sticky could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): satisfied could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): cancel could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): rack could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): flying could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): musty could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): intrusive could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): bottom could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): awful could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): disturbance could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): spectacular could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): buck could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): personal could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): pity could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): beware could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): unable could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): planning could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): boil could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): smelling could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): hearing could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): emergency could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): difficulty could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): tree could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(.): overpriced could not be fit on page. It will not
## be plotted.
## Warning in comparison.cloud(.): disappointment could not be fit on page. It will
## not be plotted.
## Warning in comparison.cloud(.): expectation could not be fit on page. It will
## not be plotted.

## Joining, by = "word"

tm

Create corpus

corpus_review <- Corpus(VectorSource(as_tibble(data$comments)))

Clean corpus

  • Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. However, the stemming method changes words such as earlier to earli and checking to checkin as shown above. As such, this process was excluded.
  • Lemmatization, on the other hand, takes into consideration the morphological analysis of the words.
corpus_review <- tm_map(corpus_review,tolower)
## Warning in tm_map.SimpleCorpus(corpus_review, tolower): transformation drops
## documents
corpus_review=tm_map(corpus_review, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus_review, removePunctuation): transformation
## drops documents
corpus_review=tm_map(corpus_review, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corpus_review, removeWords,
## stopwords("english")): transformation drops documents
corpus_review=tm_map(corpus_review, removeWords,c("also","is","in","to","with"))
## Warning in tm_map.SimpleCorpus(corpus_review, removeWords, c("also", "is", :
## transformation drops documents
#corpus_review[[1]][1]
review_tdm <- DocumentTermMatrix(corpus_review)
review <- as.data.frame(colSums(as.matrix(review_tdm)))
review <- rownames_to_column(review) 
colnames(review) <- c("term", "num")
review <- arrange(review, desc(num))
#View the top 10 most common words
#review[1:10]
#barplot(review[1:20], col = "steel blue", las = 2)
wordcloud(review$term, review$num,max.words = 50, colors = "red")

TF-IDF

TF-IDF is short for Term Frequency-Inverse Document Frequency. Term Frequency is the number of times word is found within a document. Invest Document Frequency is the inverse number of times word is found in collection of documents. A word which appears frequently within a document but is not frequently found in collection will have high TF IDF score as this word is relevant to the document.

## `summarise()` has grouped output by 'neighbourhood_group_cleansed'. You can override using the `.groups` argument.
## Joining, by = "neighbourhood_group_cleansed"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Topic Modelling (LDA)

Latent Dirichlet allocation is an example of topic modeling algorithm, based on 2 principles: 1. Every document is a mixture of topics.For example, document A is 90% topic on location and 10% on host’s hospitality. Whereas, document B is 30% topic on location and 70% on host’s hospitality. 2. Every topic is a mixture of words. For instance, based on Airbnb data, one topic can be cleanliness, and the other topic can on amenities.

To set the number of topics, set k=x where x is a numerical value.

Geospatial Analysis

shp files are obtained from data.gov region planning and region boundary without sea

Data Preparation

region_data <-data_comments %>% 
  group_by(neighbourhood_group_cleansed) %>% 
  inner_join(afinn) %>% 
  count(word,value) %>% 
  mutate(score=value*n) %>%
  group_by(neighbourhood_group_cleansed) %>% 
  summarise(mean_score=mean(score))
## Joining, by = "word"
datatable(region_data)
subregion_data <-data_comments %>% 
  group_by(neighbourhood_cleansed) %>% 
  inner_join(afinn) %>% 
  count(word,value) %>% 
  mutate(score=value*n) %>%
  group_by(neighbourhood_cleansed) %>% 
  summarise(mean_score=mean(score))
## Joining, by = "word"
datatable(subregion_data)

** Read Shape File**

mpsz <- st_read(dsn = "data/geospatial", 
                layer = "MP14_SUBZONE_NO_SEA_PL")%>%
  group_by(PLN_AREA_N) %>%
  summarise(geometry = sf::st_union(geometry))
## Reading layer `MP14_SUBZONE_NO_SEA_PL' from data source `C:\Users\joeyc\ISSS608\ourshinyPET\content\english\blog\text-mining\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## Projected CRS: SVY21

Join Data

airbnb_clean_region <- right_join(mpsz,region_data, c("PLN_AREA_N" = "neighbourhood_group_cleansed" ))

airbnb_clean_subregion <- right_join(mpsz,subregion_data, c("PLN_AREA_N" = "neighbourhood_cleansed" ))

Create Map

map <- tm_shape(mpsz)+
  tm_fill("PLN_AREA_N",title="Region",palette = "Reds")+
  tm_borders()+
  tm_layout(legend.outside=TRUE, legend.outside.position="right")
map
## Warning: Number of levels of the variable "PLN_AREA_N" is 55, which is
## larger than max.categories (which is 30), so levels are combined. Set
## tmap_options(max.categories = 55) in the layer function to show all levels.
## Some legend labels were too wide. These labels have been resized to 0.38. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.

## tmap mode set to interactive viewing

To allow better visualisation, several factors were taken into consideration:

  • Colour RColourBrewer can be used to select a prefered colour palette that suits the theme of the topic. Sequential colour scheme e.g. from red to white by means of the comman brewer.pal can be used.

  • Legend To change the default title argument, use tmfill(variable, title=“Region”). To reposition the legend, use tm_layout(legend.position=c(“right”,“bottom”)).To set the legend outside, tm_layout(legend.outside=TRUE, legend.outside.position=“right”) can be used. To set a title for the map, use tm_layout(“Title”).

Storyboard

Preparing the storyboard for the design of the sub-module.

The shiny app submodule will have tabs for - Data - Token Frequency/Wordcloud - Sentiment Analysis - Topic Modeling - Network Analysis - Geospatial Analysis

Install Shiny Packages

shinypackages <- c("shiny","shinydashboard","DT")

for (p in shinypackages){
  if (!require(p,character.only=T)){
    install.packages(p)
  }
  library(p, character.only=T)
}
## Loading required package: shinydashboard
## 
## Attaching package: 'shinydashboard'
## The following object is masked from 'package:graphics':
## 
##     box

Buttons

Various Widgets are considered.

Data

To allow flexibility, shiny application would allow users to upload files for further visualisation. Steps to upload file:

  1. Select “Browse..”
  2. Locate and select merged CSV file to be uploaded
  3. After uploading, the variables will be autopopulated

Variables to be shown in datatable can be chosen in 2 ways: 1. Typing to search for a variable 2. Scroll and select

Token Frequency

In the second tab, the token frequency would show the word cloud based on the following buttons:

  1. Number of words: The slider allows input of single values and ranges.
sliderTextInput(
   inputId="Id096",
   label="Choose a range:", 
   choices=c(1, 10, 100, 500, 1000),
   grid=TRUE
)
  1. Stopwords: The text input button allows users to decide the stop words they want to add in.
textInput(
   inputId="Id096",
   label="Input Text:", 
   grid=TRUE
)
  1. Theme of Wordcloud: Wordcloud2 has functions for creating wordcloud theme.
tidy_wordcloud+WCtheme(1)
tidy_wordcloud+WCtheme(2)
tidy_wordcloud+WCtheme(3)
tidy_wordcloud+WCtheme(2) + WCtheme(3)
  1. N-gram (mono, bi, tri)

Issue with github

During the process of pushing to github, i face the following error message:

Trouble with github Error in git2r::add(repo, paths) : Error in ‘git2r_index_add_all’: the index is locked; this might be due to a concurrent or crashed process

Annex

Kiatkawsin, K., Sutherland, I., & Kim, J. (2020). A Comparative Automated Text Analysis of Airbnb Reviews in Hong Kong and Singapore Using Latent Dirichlet Allocation. Sustainability (Basel, Switzerland), 12(16), 6673–. https://doi.org/10.3390/su12166673

Zhang, J. (2019). What’s yours is mine: exploring customer voice on Airbnb using text-mining approaches. The Journal of Consumer Marketing, 36(5), 655–665. https://doi.org/10.1108/JCM-02-2018-2581